A New Sequential Mining Approach to XML Document Similarity Computation

نویسندگان

Ho-pong Leung

Korris Fu-Lai Chung

Stephen Chi-fai Chan

چکیده

1 Manuscript submitted to Postgraduate Research Day 2 Corresponding author Abstract Measuring the structural similarity among XML documents is the task of finding their semantic correspondence and is fundamental to many web-based applications. While there exist several methods to address the problem, the data mining approach seems to be a novel, interesting and promising one. It works on the idea of extracting paths from XML documents, encoding them as sequences and finding the maximal frequent sequences using the sequential pattern mining algorithms. In view of the deficiencies encountered by ignoring the hierarchical information in encoding the paths for mining, a new sequential pattern mining scheme for XML document similarity computation is proposed in this paper. It takes use of a preorder tree representation (PTR) to encode the XML tree’s paths so that the element’s semantic and the hierarchical structure of document can be taken into accounts when computing the structural similarity among documents. In addition, it includes a post-processing step to reuse the mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity between XML documents can be introduced. Encouraging experimental results were obtained and reported.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

The Process and Applications of Xml Data Mining

XML has gained popularity for information representation, exchange and retrieval. As XML material becomes more abundant, its heterogeneity and structural irregularity limit the knowledge that can be gained. The utilisation of data mining techniques becomes essential for improvement in XML document handling. This chapter presents the capabilities and benefits of data mining techniques in the XML...

متن کامل

Edit Distance between XML and Probabilistic XML Documents

Probabilistic XML is a hierarchical data model capturing uncertainty of both value and structure. The ability to compute the similarity between an XML document and a probabilistic XML document is a building block of many applications involving querying, comparison, alignment and classification, for instance. The new challenge in efficiently computing such similarity is the multiplicity of the p...

متن کامل

Mining Sequential Trees in a Tree Sequence Database

Tree structures are used extensively in domains such as XML data management, web log analysis, biological computing, and so on. In this paper we introduce the problem of mining frequent sequential trees in a large tree sequence database. We present a framework for mining frequent sequential trees in a so-called tree sequence database. Basically, this framework employs a transformation-based app...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

A New Sequential Mining Approach to XML Document Similarity Computation

نویسندگان

چکیده

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

The Process and Applications of Xml Data Mining

Edit Distance between XML and Probabilistic XML Documents

Mining Sequential Trees in a Tree Sequence Database

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

عنوان ژورنال:

اشتراک گذاری